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Abstract 

PAC-Bayes bounds are among the most accurate generalization bounds for classifiers 
learned from independently and identically distributed (IID) data, and it is particularly 
so for margin classifiers: there have been recent co ntributions showing how practical these 



bounds can be either to perform model select ion (lAmbroladze et al.l . 120071 ) or even to di 



rectly guide the learning of linear classifiers ([Germain et al.l . 120091 ). However, there are 
many practical situations where the training data show some dependencies and where the 
traditional IID assumption does not hold. Stating generalization bounds for such frame- 
works is therefore of the utmost interest, both from theoretical and practical standpoints. 
In this work, we propose the first - to the best of our knowledge - PAC-Bayes generalization 
bounds for classifiers trained on data exhibiting interdependencies. The approach under- 
taken to establish our results is based on the decomposition of a so-called dependency graph 
that encodes the dependencies within the data, in sets of independent data, thanks to graph 
fractional covers. Our bounds are very general, since being able to find an upper bound 
on the fractional chromatic number of the dependency graph is sufficient to get new PAC- 
Bayes bounds for specific settings. We show how our results can be used to derive bounds 
for ranking statistics (such as Auc) and classifiers trained on data distributed according 
to a stationary /3-mixing process. In the way, we show how our approach seemlessly allows 
us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization 
bound for classifiers learned on data from stationary (ys-mixing distributions. 

Keywords: PAC-Bayes bounds, non IID data, ranking, U-statistics, mixing processes. 



1. Introduction 



1 . 1 Background 

Recently, there has been much progress in the field of generalization b ounds for classifiers, 



the most noticeable of whi ch are Rademacher-compl exity-based bounds (iBartlett and Mendelson 



2002 : iBartlett et all. 120051) . stabil ity-based bounds (jBousquet and Elisseeffl. 120021') and Pac - 



Bayes bounds ( McAllestei . 19991 ) . PAC-Bayes bounds, introduc e d by McAllestei ( 19991') 
and refined in several occasions (jSeegeiJ, l2002al : iLangfordl . l2005l : lAudibert and Bousquetl . 
20071 ) ■ are some of the most appealing advances from the tightness and ac curacy points of 
view (an excellent monograph on the PAC-Bayesian framework is that of ICatonil (j2007l )). 
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Among others , striking results have be en obtained concerning PAC-Bayes bounds for hn- 
ear classifiers: lAmbroladze et al.l (120071) showed tha t PAC-Bayes bounds are a viable route 
to do actual model selection; Germain et al. ( 20091 ) recently proposed to learn linear clas- 
sifiers by directly minimizing the linear PAC-Bayes bound with conclusive results, while 
Langford and Shawe-taylor ( 20021 ) showed that under some margin assumption, the PAC- 
Bayes framework allows one to tightly bound not only the risk of the stochastic Gibbs 
classsifier (see below) but also the risk of the Bayes classifier. The variety of (algorithmic, 
theoretical, practical) outcomes that can be expected from original contributions in the 
PAC-Bayesian setting explains and justifies the increasing interest it generates. 



1.2 Contribution 

To the best of our knowledge, PAC-Bayes bounds have essentially been derived for the 
setting where the training data are independently and identically distributed (IID). Yet, 
being able to learn from non-IID data while having strong theoretical guarantees on the 
generalization properties of the learned classifier is an actual problem in a number of real 
world applications such as, e.g., bipartite ranking (and more generally /c-partite ranking) 
or classification from sequential data. Here, we propose the first PAC-Bayes bounds for 
classifiers trained on non-IID data; they constitute a generalization of the IID PAC-Bayes 
bound and they are generic enough to provide a principled way to establish generalization 
bounds for a number of non-IID settings. To establish these bounds, we make use of simple 
tools from probability theory, convex ity properties of some functions, and we exploit the 
notion of fractional covers of graphs (jSchreinerman and Ullmanl . 119971 ) . One way to get a 
high level view of our contribution is the following: fractional covers allow us to cope with 
the dependencies within the set of random variables at hand by providing a strategy to 
make (large) subsets of independent random variables on which the usual IID PAC-Bayes 
bound is applied. Note that we essentially provide bounds for the case of identically and 
non-independently distributed data; the additional results that we give in the appendix 
generalizes to non-identically and non-independently distributed data. 



1.3 Related Results 

We would like to mention that the idea of dealing with sums of interdependent random 
variables by separating them into subsets of independent variab les to establish concentra- 
tion inequalities dates back to the work of Hoeffding ( 19481 . 1963 ) on U-statistics. Explicity 
using the notion of (fractional) covers - or equiv alently, color i ngs - of gr aphs to derive such 
concentration ine qualities has be e n pro posed by iPemmaraju (|200lh and l.TansonI (l2004l ) and 
later extende d by Usunier et al. ( 20061 ) to deal with functions that are different from the 



sum. Just as Usunier et al 



(|2006l ;). who used their concentration inequality to provide gen- 
eralization bounds based on the fractional Rademacher complexity, we take the approach of 
decomposing a set of dependent random variables into subsets of dependent random vari- 
ables a step beyond establishing concentration inequality to provide what we call chromatic 
PAC-Bayes generalization bounds. 

The genericity of our bounds is illustrated in several ways. It allows us to derive gen- 
eralization bounds on the ranking performance of scoring/ranking functions using two dif- 
ferent performance measures, among which the Area under the Roc curve (Auc) . These 
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bounds are direct l y rela ted t o the work of Agarwal et al. ( 2005 ). Agarwal and Nivogi ( 20091 ). 
Clemengon et al. ( 20081 ) and Freund et al. ( 2003 ). Even if our bounds are obtained as sim- 



ple specific instances of our gener ic PAC-Bayes bounds , the y exhibit interesting peculiari- 
ties. Compared with the bound of Agarwal et al. ( 20051 ) and Freund et al. ( 20031 ). our Auc 
bound depends in a less stronger way on the skew (i.e. the imbalance between positive and 
negative data) of the distribution; besides it does not rest on (rank-)shatter coefficients/VC 
dimension that may sometimes be hard to a ssess accurately; in a. d dition . our bound directly 
applies to (kernel-based) linear classifiers. Agarwal and Nivogi ( 20091 ) base their analysis 
of ranking performances on algorithmic stability, and the qualitative comparison of their 
bounds and ours is not straightforward because stability arguments are somewhat different 
from the arguments used f or PAC-Bayes bounds (and other uniform bounds). As already 
observed by Janson ( 20041 ) . coloring provides a way to generalize large deviation results 
based on U-statistics; this observation carries over when generalization bounds are con- 
sidered, which al l ows u s to draw a connection between the results we obtain and that of 
Clemencon et al.l (|2008l l. 

Another illustration of the genericity of our approach deals with mixing processes. In 
particular, we show how our chromatic bounds can be used to easily derive new gener- 
alization bounds for /3-mixing processes. Rademac her complexity based bounds for such 
type of processes have recently been established by Mohri and Rostamizadeh ( 20091 ) . To 
the best of our knowledge, it is the first time that such a bound is given in the PAC-Bayes 
framework. The striking f eature is t hat it is done at a very low price: the independent 
block method proposed by Yu ( 19941 ) directly gives a dependency graph whose chromatic 
number is straightforward to compute. As we shall see, this suffices to instantiate our chro- 
matic bounds, which, after simple calculations, leads to appropriate generalization bound. 
For sake of completeness, we also provide a PAC-Bayes bound for stationary (/9-mixing pro- 
cesses; it is based on a different approach and its presentation is postponed to the appendix 
together with the tools that allows us to derive it. 



1.4 Organization of the Paper 

The paper is organized as follows. Section [2] recalls the standard IID PAC-Bayes bound. 
Section [3] introduces the notion of fractional covers and states the new chromatic PAC- 
Bayes bounds, which rely on the fractional chromatic number of the dependency graph of 
the data at hand. Section [J] provides specific versions of our bounds for the case of IID data, 
ranking and stationary /3-mixing processes, giving rise to original generalization bounds. A 
PAC-Bayes bound for stationary (/3-mixing based on arguments different from the chromatic 
PAC-Bayes bound is provided, in the appendix. 



2. IID PAC-Bayes Bound 

We introduce notation that will hold from here on. We mainly consider the problem of 
binary classification over the input space X and we denote the set of possible labels as 
y = {—1^+1} (for the case of ranking described in section [U we we use y = TZ); Z denotes 
the product space X y. y. % Q TZ^ is a family of real valued classifiers defined on X: 
for h & Ti, the predicted output of x £ X is given by sign(/i(x)), where sign(a;) = +1 if 
X > and —1 otherwise. D is a probability distribution defined over Z and denotes 
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the distribution of an m-sample; for instance, = <8>^x-D = is the distribution of an 
IID sample Z = {Zi}^^ of size m {Zi D, i = 1 . . . m). P and Q are distributions over 

"H. For any positive integer m, [m] stands for {1, . . . , m}. 

The IIP PA C-Bayes bound, can be stated as follows (jMcAUesteil . bood : ISeegerl . l2002al : 
Langfordl . [2005l ). 



Theorem 1 (IID PAC-Bayes Bound) \/D, VH, V(5 G (0,1], VP, with probability at least 
1 — 5 over the random draw ofTj^ = D™, the following holds: 



1 



VQ, kl(eQ(Z)||eQ)<- 



m 



KL(Q||P) + ln 



m + 1 



(1) 



This theorem provides a generalization error bound for the Gibbs classifier gq: given a dis- 
tribution Q, this stochastic classifier predicts a class for x € A" by first drawing a hypothesis 
h according to Q and then outputting sign(/i(x)). Here, eq is the empirical error of gq on 
an IID sample Z of size m and eq is its true error: 



eQ(Z) := E;,^Q- J2 Z^) = Eh^qR{h, Z) with R{h, Z) := i ^I^^ r(/i, Z,) 



1=1 



(2) 



with R{h) :=Ez^D„P(/i,Z), 



eg := Ez^T,^eq{Z) = Eh^qR{h) 
where, for Z = {X,Y), 

r(h,Z) ■= Iy/i(x)<o- 

Note that we will use this binary 0-1 risk function r throughout the paper and that a 
generalization of our results to bounded real-valued risk functions is given in appendix. 
Since Z is an (independently) identically distributed sample, we have 



R{h) = Ez~D™i?(/i, Z) = ¥.z^Dr{h, Z) 



(3) 



For p,q G [0, 1], kl(gl|p) is the Kullback-Leibler divergence between the Bernoulli distribu- 
tions with probabilities of success q and p, and KL((5||P) is the Kullback-Leibler divergence 
between Q and P: 



kl{q\\p): 
KL(Q||P) :=Eh^qln 



ln:i + (l-^ 
p 

Q{h) 



In- 



p 



P{hy 



where kl(0||0) = kl(l||l) = 0. All along, we assume that the posteriors are absolutely 
continuous with respect to their corresponding priors. 

It is straightforward to see that the mapping klg : t ^ kl(g||g -|- t) is strictly increasing 
for t G [0, 1 — (7) and therefore defines a bijection from [0, \ — q) to TZ^: we denote by kl~^ 
its inverse. Then, as pointed out by lSeegei (|20n2ah . the function kpi : {q,e) ^ ]d-\q,e) = 
kl~"'^(e) is well-defined over [0, 1) x 7^+, and, by definition: 



t > kr^{q,e) <^kl{q\\q + t) > e. 
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This makes it possible to rewrite bound ([T|) in a more 'usual' form: 

m + 1 



VQ, eg <eQ(Z)+krMeQ(Z),i 



m 



KL(QIIP) +ln- 



5 



(4) 



We observe that even if bounds ([T]) and (jl]) apply to the risk eg of the stochastic clas- 
sifier gq, a straightforward argument gives that, if bq is the (deterministic) Bayes classifier 
such that hq(x) = sign(E/,,^p fe( x)), then Rjbg) = Kzr-^nribq, Z) < 2eq (see for instance 
( Herbrich and Graepei hoom . iLangford and Shawe-tavbrl ]2QVii ) show that under some 



margin assumption, R{bq) can be bounded even more tightly. 
3. Chromatic PAC-Bayes Bounds 

The problem we focus on is that of generalizing Theorem [J to the situation where there may 
exist probabilistic dependencies between the elements Zj of Z = {Zi}^^ while the marginal 
distributions of the Zj's are identical. As announced before, we provide PAC-Bayes bounds 
for classifiers trained on identically but not independently distributed data. These results 
rely on properties of a dependency graph that is built according to the dependencies within 
Z. Before stating our new bounds, we thus introduce the concepts of graph theory that will 
play a role in their statements. 

3.1 Dependency Graph, Fractional Covers 

Definition 2 (Dependency Graph) Let Z = {Zi}'^^ be a set of m random variables 
taking values in some space Z. The dependency graph r(Z) = iV^E) ofZ is such that: 

• the set of vertices V o/r(Z) is V = [m]; 

• {i,j) E (there is no edge between i and j) Zi and Zj are independent. 



Definition 3 (Fractional Covers, ISchreinerman and Ullman dlii^)) LetT = (y,E) 



be an undirected graph, with V = [m] . 

• C C.V is independent if the vertices in C are independent (no two vertices in C are 
connected). 

• C = {Cj}"^^, with Cj C. V , is a proper cover of V if each Cj is independent and 
[Jj^i Cj = V. It is exact if C is a partition of V. The size of C is n. 

• C = {{Cj,ujj)}^^i, with Cj C V and tOj G [0, 1], is a proper exact fractional cover of 
V if each Cj is independent and Mi G V, Yl^=i^j^i&Cj = 1/ <^(C) = Yl^=i^i 
chromatic weight of C . 

• The (fractional) chromatic number x(r) (x*{^)) is the minimum size (chromatic 
weight) over all proper exact (fractional) covers ofT 

A cover is a fractional cover such that all the weights are equal to 1 (and all the results 
we state for fractional covers apply to the case of covers). If n is the size of a cover, it means 
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that the nodes of the graph at hand can be colored with n colors in a way such that no two 
adjacent nodes receive the same color. 

The problem of compu t ing t he (fractional) chromatic number of a graph is Np-hard 
( Schreinerman and Ullmanl . [l997l l. However, for some particular graphs as those that come 
from the settings we study in Section 21 this number can be evaluated precisely. If it cannot 
be evaluated, it can be upper bounded using the following property. 



Property 1 (jSchreinerman and UUmanl (119971 )) Let T = {V,E) be a graph. Let c(T) 
be the clique number ofT, i.e. the order of the largest clique in T. Let A(T) be the maximum 
degree of a vertex in T. We have the following inequalities: 



In addition, 1 = c{T) 



i<c(r) <x*(r) <x(r) < A(r) + i. 

X*(r) = x(r) = A(r) + l if and only if T is totally disconnected. 



If Z = {Zi}^-^ is a set of random variables over Z then a (fractional) proper cover of 
r(Z), splits Z into subsets of independent random variables. This is a crucial feature to 
establish our results. In addition, we can see x*(r(Z)) and x(r(Z)) as measures of the 
amount of dependencies within Z. 

The following lemma (Lemma 3.1 in ( Janson . 20041 )) will be very useful in the following. 



Lemma 4 // C = {{Cj,ujj)}^^^ is an exact fractional cover ofT = (V,E), with V = [m], 
then 

m n 
i=l j=l keCj 

In particular, m = X]j=i '^il^il- 

3.2 Chromatic PAC-Bayes Bounds 

We now provide new PAC-Bayes bounds for classifiers trained on samples Z drawn from 
distributions where dependencies exist. We assume these dependencies are fully deter- 
mined by Dm and we define the dependency graph r(Dm) of to be r(Dm) = r(Z). As 
said before, the marginal distributions of along each coordinate are the same and are 
equal to some distribution D. 

We introduce additional notation. PEFC(Dm) is the set of proper exact fractional covers 
of r(Dm)- Given a cover C = {(Cj,a;j)}"^]^ € PEFC(Dm), we use the following notation: 

• Dm\ the distribution of Z^-'): it is equal to dI'-^jI = ^^^^D {Cj is independent); 

• a = (aj)i<j<n with aj = ojj/io{C): we have oij > and aj = 1; 

with -Kj = bJj\Cj\/m: we have vrj > and Ylj T^j = 1 (cf- Lemma H]). 

In addition, P^ and Q„ denote distributions over W", Pi and Qn are the marginal distri- 
butions of P„ and Q„ with respect to the j'th coordinate, respectively. 
We can now state our main results. 
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Theorem 5 (Chromatic PAC-Bayes Bound (I)) VD„, MU, 'iS G (0, 1], VC = {(Cj, Wj)}^^! G 
PEFC(Dm), VP„, with probability at least 1 — 5 over the random draw of Z ^ Dm, the fol- 
lowing holds: 



VQ„, kl(e-Q„(Z)||eQ, 
where oj stands foruj(C), and 



< 



OJ 



m 



j;a,KL(g^J|P^)+ln 



m + Ul) 
6uj 



(5) 



eQ„ := Ez^D,„eQ„(Z). 

Proof Deferred to Section [3^ 



We would like to emphasize that the same type of result - using the same proof techniques 
- can be obtained if simple (i.e. not exact nor proper) fractional covers are considered. 
However, as we shall see, the 'best' (in terms of tightness) bound is achieved for covers 
from the set of proper exact fractional covers, and this is the reason why we have stated 
Theorem [5] with a restriction to this particular set of covers. 

The empirical quantity eQ^(Z) is a weighted average of the empirical errors on Z^^^ of 
Gibbs classifiers with respective distributions Qn. The following proposition characterizes 
eQ„ = Ez^D,„eQ„(Z). 

Proposition 6 VD„, yU, VC = {{Cj,ujm^^ G Pefc(D^), VQ„.- eQ„ = Ez^D„eQ„(Z) 



J 

n ■ 



is the error of the Gibbs classifier based on the mixture of distributions Q'^ = Y2^=i '^jQ' 
Proof From the definition of tt, ttj > and ^"=i T^j = 1- Thus, 

Ez-D^eQjZ) = Ez^D^ 5^^,E^^Q,^^(/i,Z(^')) 

j 

= ^ -KjEhr^QM^^^^^^ifjRih, Z^^^) (marginalization) 

j 

= E^^*h~Qi^(^) (\w^iii^)R(h,z(^^) = R{h),yj) 



E 



R{h)=Eh^Q^R{h). 



Where, in the third line, we have used the fact that the variables in Z^^^ are identically 
distributed (by assumption, they are IID). ■ 



Remark 7 The prior 'Pn and the posterior Qri enter into play in Proposition ^ and Theo- 
rem\^ through their marginals only. This advocates for the following learning scheme. Given 
a cover and a (possibly factorized) prior Pfi; look for a factorized posterior — ^^=iQ j 
such that each Qj independently minimizes the usual IID PAC-Bayes bound given in The- 
oremUl on each Z^^\ Then make predictions according to the Gibbs classifier defined with 
respect to Q'^ = Ylj'^jQj- 
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The following theorem gives a result that readily applies without choosing a specific 
cover. 

Theorem 8 (Chromatic PAC-Bayes Bound (II)) VD^, MH, V5 G (0, 1], VP, with prob- 
ability at least 1 — 6 over the random draw of Z ^ Dmi the following holds 



VQ, kl(eQ(Z)||eQ) < ^ 



KL(Q||P) + ln^^ 



(6) 



where x* i^ the fractional chromatic number of T(Dm), and where eQ(Z) and eg are as 
in (0). 

Proof This theorem is just a particular case of TheoremlH Assume that C = {{Cj,ujj)}^^i G 
Pefc(D^) such that uj{C) = x*(r(D„)), P„ = ^"J^^P = P" and Q„ = 0]^^Q = Q", for 
some P and Q. 

For the right-hand side of ([6]), it directly comes that 

^ajKL{QU\P^) = ^a,KL(Q||P) = KL(Q||P). 
j j 

It then suffices to show that eQ,^(Z) = eQ(Z): 

eqjZ) = ^vr,E,^Q,^P(/.,z(^-)) = 7r,E,^QP(/., z(^-)) 
j j 

j k 

= Ehr^Q—y^r{h,Zi) (cf. Lemma SD 

= E;,^qP(/i,Z) = eQ(Z). 



A few comments are in order. 

• Ax* worsening. This theorem says that even in the case of non IID data, a PAC-Bayes 
bound very similar to the IID PAC-Bayes bound can be stated, with a worsening 
(since x* ^ 1) proportional to x*) i-6 proportional to the amount of dependencies 
in the data. In addition, the new PAC-Bayes bounds is valid with any priors and 
posteriors, without the need for these distributions to depend on the chosen cover (as 
is the case with the more general Theorem [5]). 

• X*- the optimal constant. Among all elements of PEFC(Dm ), X* is the best constant 
achievable in terms of the tightness of the bound ([6]) on cq : getting an optimal coloring 
gives rise to an 'optimal' bound. Indeed, it suffices to observe that the right-hand side 
of ([5]) is decreasing with respect to a; when all Qn are identical (we let the reader 
check that). As x* is the smallest chromatic weight, it gives the tightest bound. 
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(a) Ti.edge (b) Vu 

Figure 1: r„ is the subgraph induced by Ti.edge ~ which contains only one edge, between u 
and V - when u is removed: it might be preferable to consider the distribution corresponding 
to r„ in Theorem [8] instead of the distribution defined wrt Ti.gdge, since x*(ri-edge) = 2 and 
X*(Xu) = 1 (see text for detailed comments). 



• r(Dm) vs. induced subgraphs. If s C [m] and Zg = {Zg : s G s}, it is obvious 
that Theorem [8] holds for |s|-samples drawn from the marginal distribution Dg of Zg. 
Considering only Zg amounts to working with the subgraph r(Dg) of T(Dm) induced 
by the vertices in s: this might provide a better bound in situations where x*(Dg)/|s| 
is smaller than x*(Dm)/m (this is not guaranteed, however, because the empirical 
error eQ(Zg) computed on Zg might be larger than cq^Z)). To see this, consider a 
graph Fi.edgc = (^j E) of m vertices where \E\ = 1, i.e. there are only two nodes, say 
u and V, that are connected (see Figured]). The fractional chromatic number xl-edge 
Ti-edge is 2 (u and v must use distinct colors) while the (fractional) chromatic number 
Xu of the subgraph F^j of Fi_edge obtained by removing is 1: Xi_edge twice as big 
as Xu while the number of nodes only differ by 1 and, for large m, this ratio roughly 
carries over for xl-edgc/^ Xu/i^ ~ 

This last comment outlines that considering a subset of Z, or, equivalently, a subgraph 
of F(Dm), in ([6]), might provide a better generalization bound. However, it is assumed that 
the choice of the subgraph is done before computing the bound: the bound does only hold 
with probability 1 — 5 for the chosen subgraph. To alleviate this and provide a bound that 
takes advantage of several induced subgraphs, we have the following proposition: 

Proposition 9 Let {m}"^^ denote {s : s C [m],|s| = m — k}. VD^, V'H, V/c G [m], 
yS G (0, 1], VP, with probability at least 1 — 6 over the random draw of Z ^ Dm- VQ, 



< min |eQ(Z,) + krifeQ(Z,),g 

s6{m}#''- 1^ \ |S| 



KL«3||P) + l„M±^i + l„(7)+l„i 



(7) 



where xl 'is the fractional chromatic number o/F(Dg), and where eQ(Zg) is the empirical 
error of the Gibbs classifier gq on Zg, that is: eQ(Zg) = E/i^Qi2(/i, Zg). 

Proof Simply apply the union bound to equation ([6]) of Theorem [8) for fixed k, there are 
(m-fc) ~ (T) subgraphs and using <5/(™) makes the bound hold with probability 1 — <5 for 
all possible C^) subgraphs (simultaneously). Making use of the form (jH) gives the result. ■ 

This bound is particularly useful when, for some small k, there exists a subset s C {m}*'^ 
such that the induced subgraph F(Dg), which has k fewer nodes than F(Dm), has a fractional 
chromatic number Xg that is smaller than x*i^m) (as is the case with the graph Fi_edge of 
Figure [H where k = 1). Obtaining a similar result that holds for subgraphs associated with 
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sets s of sizes larger or equal to m — k is possible by replacing In (™) with In^^^Q in 
the bound (in that case, k should be kept small enough with respect to m, e.g. k = C'm(l), 
to ensure that the resulting bound still goes down to zero when m oo). 



3.3 On the Relevance of Fractional Covers 



One may wonder whether using the fractional cover framework is the only way to establish 
a result similar to the one provided by Theorem [5l Of course, this is not the case and 
one may imagine other ways of deriving closely related results without mentioning the idea 
of fractional/cover coloring. (For instance, one may manipulate subsets of independent 
variables, assign weights to these subsets without referring to fractional covers, and arrive 
at results that are comparable to ours.) 

However, if we assume that singling out independent sets of variables is the cornerstone 
of dealing with interdependent random variables, we find it enlightning to cast our approach 
within the rich and well-studied fractional cover/coloring framework. On the one hand, our 
objective of deriving tight bounds amounts to finding a decomposition of the set of random 
variables at hand into few and large independent subsets and taking the graph theory point 
of view, this obviously corresponds to a problem of graph coloring. Explicitly using the 
fractional cover/coloring argument allows us to directly benefit from the wealth of related 
results, such as Property 1 or, for instance, approaches as to how compute a cover or 
approximate the fractional chromatic number (e.g., linear programming). On the other 
hand, from a technical point of view, making use of the fractional cover argument allows 
us to preserve the simple structure of the proof of the classical IID PAC-Bayes bound to 
derive Theorem [5j 

To summarize, the richness of the results on graph (fractional) coloring provides us with 
elegant tools to deal with a natural representation of the dependencies that may occur 
within a set of random variables. In addition, and as showed in this article, it is possible to 
seamlessly take advantage of these tools in the PAC-Bayesian framework (and probably in 
other bound-related frameworks). 



3.4 Proof of Theorem [5] 



A pro of in three steps, following the lines of the proofs given bv lSeegeij (|2002al ) and lLangford 
Ibr the IID PAC-Bayes bound, can be provided. 



Lemma 10 VDm, V5 G (0, 1], VC = {{Cj,ujj)}J^i, VP„ distribution over , with proba- 
bility at least 1 — 5 over the random draw of ^ E)m; ihe following holds (here, oj stands 
foru{C)) 

Eh~p„'rajel^^"^^(^('^-^''')"^('^^)) < (8) 

OLO 

where h = (hi, . . . , /i„) is a random vector of hypotheses. 
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Proof We first observe the following: 

j 3 

< ^ aj ( I Cj I + 1) (Lemma [20l Appendix) 

3 

= Wu,{\C,\ + l) 
uj — 

.7 

= , (Lemma |4|) 

where using Lemma [20l is made possible by the fact that Z^-^^ is an IID sample. Therefore, 

According to Markov's inequality (Theorem 122^ Appendix), 



Lemma 11 VD^, VC = {{Cj^ijjj)Yj=\j ^Pn; VQn, with probability at least 1 — 5 over the 
random draw ofZ^^ D^, the following holds 

^ E •=! Z^'^)ll^(^)) < E •=! KL(Q^jlP^) + In (9) 

Proof It suffices to use Jensen's inequality (Theorem 12 H Appendix) with In and the fact 
that Ex^pfiX) = Ex^Q^fiX), for all f,P,Q. Therefore, VQ„: 

lnEh.p„ j;a,el^^l'^i(^('^-2*^')ll«('^^)) = In ^ a,-E^^^,el^^l kK^e^.zU))!!^^) 



In^a.E 



> 



Pnjh) ^|C,.|kl(R(fe,zO))||R(fe)) 

QUh) 

= -^a, KL(Q{||P^) + E a,\Cj\E^^Q,^ kl (i?(/i, Z(^))||i?(/i) 
i 3 

= - E a, KL(Q^jlP^) + ^ J] vr.E^^Q,^ kl (i?(/i, Z(^))| |i?(/i) 
i 

Lemma [TO] then gives the result 



(Jensen's inequality) 



3 J 
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® © 

(a) IID data (b) Bipartite ranking data 

Figure 2: Dependency graphs for different settings described in section HI Nodes of the same 
color are part of the same cover element; hence, they are probabilistically independent, (a) 
When the data are IID, the dependency graph is disconnected and the fractional number is 
X* = 1; (b) a dependency graph obtained for bipartite ranking from a sample of 4 positive 
and 2 negative instances: x* = 4. 

Lemma 12 VD^, VC = {{Cj,u}j)}"^^, VQ,i,, the following holds 

^Ej=i^A~Q?,kl(i?(/i,z(^))||i?(/i)) > kl(eQlleQ). 

Proof This simply comes from the convexity of kl(x,y) in {x,y) (Lemma 1231 Appendix). 
This, in combination with Lemma [TT| closes the proof of Theorem [5j ■ 



4. Applications 

In this section, we provide instances of Theorem [8] for various settings; amazingly, they 
alllow us to easily derive PAC-Bayes generalization bounds for problems such as ranking 
and learning from stationary ^S-mixing processes. The theorems we provide here are all new 
PAC-Bayes bounds for different non-IID settings. 

4.1 IID Case 

The first case we are interested in is the IID setting. In this case, the training sample 
Z = {(Xi^Yi)}^^ is distributed according to D^ = and the fractional chromatic 
number of T(Drn) is X* = 1; since the dependency graph, depicted in Figure [2a] is totally 
disconnected (see Property [1]). Plugging in this value of x* i^i the bound of Theorem [8] 
gives the IID PAC-Bayes bound of Theorem [H This emphasizes the fact that the standard 
PAC-Bayes bound is a special case of our more general results. 

4.2 General Ranking and Connection to U-Statistics 

Here, the learning problem of interest is the following. D is a distribution over X x y with 
y = TZ and one looks for a ranking rule h G TZ"^^"^ that minimizes the ranking risk R^^^^{h) 
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defined as: 



(x',y')~-D 



Y')h{X,X') < 0). 



(10) 



For a random pair {X, Y), Y can be thought of as a score that allows one to rank objects: 
given two pairs {X, Y) and {X' , Y'), X has a higher rank (or is 'better') than X' if y > Y' . 
The ranking rule h predicts X to be better than X' if sign{h{X,X')) = 1 and conversely. 
The objective of learning is to produce a rule h that makes as few misrankings as possible, 

m,Yi)}l, an 



as measured by (jTO]) . Given a finite IID (according to D) sample S 
unbiased estimate of R'^''^'^^{h) is R'^^'^^{h, S), with: 



R'^''^{h,S) 



'^\Y,-Yj)h{X„Xj)<0 



■Y,jhiX„Xj)<0, 



(11) 



where Yij := (Yi — Yj). A natural question is to bound the ranking risk for any learning rule 



h given S, where the difficulty is that (|TT]) is a sum of identically but not independently 
random variables, namely the variables ^Yijh{Xi,Xj)- 



Let us define X 



{Xi,Xj), Zij 



{Xij,Yij), and Z 



{Zij}i^j. We note that 



the number i of training data suffices to determine the structure of the dependency graph 
Trank of Z and its distribution, which we denote D^(^_i). Henceforth, we are clearly in the 
framework for the application of the chromatic PAC-Bayes bounds defined in the previous 
section. In particular, to instantiate Theorem [8] to the present ranking problem, we simply 
need to have at hand the value Xrank ~ upper bound thereof - of the fractional 

chromatic number of Tj-ank- We claim that Xrank — ^(^ ~ l)/L^/2j where [x\ is the largest 
integer less than or equal to x. We provide the following new PAC-Bayes bound for the 
ranking risk: 

Theorem 13 (Ranking PAC-Bayes bound) \/D over X xy,\n-i(^ TZ^"^^ , \/5 G (0, 1], 

VP, with probability at least 1 — 5 over the random draw o/ S D^, the following holds 



Vg, kl(e^»"^-(S)||e3'"^)< 



l£/2\ 



KL(Q||P) + ln 



[£/2\ + 1 



(12) 



where 



e^^^'^iS) := E;,^qP™"^(/i, S) 



rank/ 



. To do SO, we consider 



rank , ;^ranK/c\ 

eg •= ^s^D^eg (=»)• 

Proof We essentially need to prove our claim on the bound on xla.LL^ 

a fractional cover of Fj-ank motivated by the theory of U-statistics ( Hoeffdind . 1 194 i[i963). 
R{h, S) is indeed a U-statistics of order 2 and it might be rewritten as a sum of IID blocks 
as follows 



R{h,S) 



1 1 1 ^'/'J 



iM[£/2i+i)) , 



where is the set of permutations over [i]. The innermost sum is obviously a sum of IID 
random variables as no two summands share the same indices. 
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A proper exact fractional cover Crank can be derived from this decomposition a^ 



C 



rank 



(c^ :- {Z,(.).(L£/2j+.)}Jf'^ :- JlZr^f^)} ^^^^ 



Indeed, as remarked before, each is an independent set and each random variable Zpq for 
p q, appears in exactly {£ — 2)! x [i/2\ sets (for i fixed, the number of permutations a 
such that a{i) = p and a{ [^/2j +i) = q is equal to {i — 2)!, i.e. the number of permutations 
on £ — 2 elements; as i can take [i/2\ values, this gives the result). Therefore, \/p,q,p ^ q: 

E ^'^H.^^c. = U-2)\\l/2\ ^ ^^-^^^ = (^-2)! 1^/21 ^ " ^^'L^/^J = ^' 
which proves that Crank is a proper exact fractional cover. Its weight a;(Crank) is 

C^(C,ank)=^!x^. = ^^^^. 

Hence, from the definition of Xrank' 

l{£-l)) 



/Crank — 



L£/2J 



The theorem follows by an instantiation of Theorem [8] with m := £{i — 1) and the bound 
on Xrank have just proven. ■ 

To our knowledge, this is the first PAC- Bayes bound on the ranki ng risk, while a Rademacher- 
complexity based analysis was given by Clemengon et al. ( 20081 ). In the proof, we have used 



arguments from the analysis of U-processes, which allow us to easily derive a convenient 
fractional cover of the dependency graph of Z. Note however that our framework still ap- 
plies even if not all the Zjj's are known, as required if an analysis based on U-processes is 
undertaken. This is particularly handy in practical situations where one may only be given 
the values Yij ~ but not the values of Yi and Yj - for a limited number of pairs (and 
not all the pairs). 

An interesting quest ion is to know how the so-called Hoeffding decomposition used by 
Clemencon et al. ( 20081 ) to establish fast rates of convergence for empirical ranking risk 



minimizers could be used to draw possibly tighter PAC-Bayes bounds. This would imply 
being able to appropriately take advantage of momei its of order 2 i n PAC -Bayes bounds, 
and a possible direction for that has been proposed bv lLacasse et al. (I2OO6I ). This is left for 
future work as it is not central to the present paper. 

Of course, the ranking rule may be based on a scoring function / € 7^'^ such that 
h{X,X') = f{X) — f{X'), in which case all the results that we state in terms of h can be 
stated similarly in terms of /. This is important to note from a practical point of view as 
it is probably more usual to learn functions defined over X rather than X x X (as is h). 

Finally, we would like to stress that the bound on Xrank have exhibited is actually 

rather tight. Indeed, it is straightforward to see that the clique number of Pj-ank is 2{l — 1) 



1. Note that the cover defined here considers elements Ca containing random variables themselves instead 
of their indices. This abuse of notation is made for sake of readability. 
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(the cliques are made of variables {Zip}p \J{Zpi}p for every i), and according to Property [H 
2{i — 1) is therefore a lower bound on Xrank- ^ even, then our bound on Xrank equal 
to 2{i — 1) and so is Xrank' if ^ is odd, then our bound is 2£. 

4.3 Bipartite Ranking and a Bound on the Auc 

A particular ranking setting is that of bipartite ranking, where y = {—1,+!}. Let D be 
a distribution over X x y and D^i (D-i) be the clas s conditional d i stribu tion Dx\y=+i 
{Dx\Y=-i) with respect to D. In this setting (see, e.g. Agarwal et all ( 20051 )). one may be 



interested in controlling what we call the bipartite misranking risk R^^^{h) (the reason for 
the Auc superscript will become clear in the sequel), of a ranking rule h G 7^'^^'^ by 



i?Auc(^) ._ p ^^^^^ ^^(^^ ^ (-^3) 

Note that the relation between and R'^''^ (cf. Equation ([TO]) ) can be made clear 

whenever the hypotheses h under consideration are such that h{x,x') and h{x',x) have 
opposite signs. In this situation, it is straightforward to see that 

i?'^^°'^(/i) = 2P(y = +i)p(y = -i)i?^^'^^(/i). 

Let S = {{Xi,Yi)}f^^ be an IID sample distributed according to = D^. The empirical 
bipartite ranking risk R^^'^(h, S) of /i on S defined as 

^^""(/^,S):=-^ Ih(x,,x,)<o (14) 

where is the number of positive (negative) data in S, estimates the fraction of pairs 

{Xi,Xj) that are incorrectly ranked incorrectly (given that Yi = +1 and Yj = —1) by h: it 
is an unbiased estimator of R^'^^'{h). 

As before, h may be expressed in terms of a scoring function / S TZ"^ such that 
h{X,X') = f{X) — f{X'), in which case (overloading notation): 

R^-^{f)=F..n,Af{X)<f{X'))^ndR^--{f,S) = -^ I 



'^f{X,)<f(X,), 

i:Yi = + l 
i-Yj^-l 



where w e recognize in "'^(/, S) one minus th e Area under the Roc curve, or Auc, of 
/ on S ( Agarwal et al. . 20051 : Cortes and Mohri . 20041 ) . hence the Auc superscript in the 



name of the risk. As a consequence, providing a PAC-Bayes bound on i?^"'^(/i) (or i?^^"^(/)) 
amounts to providing a generalization (lower) bound on the Auc, which is a widely used 
measure in practice to evaluate the performance of a scoring function. 

Let us define Xij := {Xi,Xj), Zij := {Xij,l) and Z := {Zij}ij:Yi=+i,Yj=-i, i-e. Z is a 
sequence of pairs Xij made of one positive example and one negative example. We then are 
once again in the framework defined earlieiH, i.e., the Zjj's share the same distribution but 



2. The slight difference with what has been described above is that the dependency graph is now a random 
variable: it depends on the Yi's. It is shown in the proof of Theorem 1 141 how this can be dealt with. 
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are dependent on each other, since Zij depends on {Z^q : p = i or q = j} (see Figure [2]). 
Note that in order to ease the reading of the present subsection, we make the imphcit 
decomposition of training set S into S = S"*" U S~, where S"*' (resp. S~) is made of the 
{£^) positive (negative) data of S; the size ^ of S is therefore £ = i~^+£^ . This decomposition 
entails a separate reindexing of the positive (negative) data from 1 to (from 1 to 
Building on Theorem [HI we have the following result: 

Theorem 14 (Auc PAC-Bayes bound) VD over X xy,yn<Z 7^'^^■^, V(5 € (0,1], VP, 
with probability at least 1 — 6 over the random draw o/ S ~ D^, the following holds 



Vg, kl(e^-(S)||e^-)< 



KL(Q||P)+ln- 



(15) 



where imm = mm{£^,£ ), and 



e^-(S) 



Eh~Qi?^"'^(/i,S) 
Es.^.e^-^(S). 



Proof The proof works in three steps and borrows ideas from lAgarwal et al.l (|2005l ). The 
first two parts are necessary to deal with the fact that the dependency graph of Z, as it 
depends on the random sample S, does not have a deterministic structure. 

Conditioning on Y = y. Let y G { — 1, +1}^ be a fixed vector and let £y and be the 
number of positive and negative labels in y, respectively. We define the distribution Dy 



as D, 



^iDy.; this is a distribution on X^. 



With a slight abuse of notation, Dy will 



also be used to denote the distribution over {X x y) of samples S = {(^i, ?/i)}f=i such 
that the sequence {Xi}f^^ is distributed according to Dy. It is easy to check that V/i € T-L, 
lEs~Dy-R'^°''(/i,S) = R'^^'^ih) (cf. equations ^ and (fHl)). 

Given S, if we define, as said earlier, Xij := {Xi,Xj), Yij := 1 and Zij := {Xij,Yij), 
then Z := {Zjj}j:y_ij:y^=_i is a sample of identically distributed variables, each with 
distribution D±i = D^i (g) Z?_i 1 over X x X x y , where y = {—1, +1} and where 1 is 
the distribution that produces 1 with probability 1. 

Letting m = £y£y we denote by Dy^^, the distribution of the training sample Z, within 
which interdependencies exist, as illustrated in Figure [2j Theorem [8] can thus be directly 
applied to classifiers trained on Z, the structure of r(Dy^m) and its corresponding fractional 
chromatic number xt being completely determined by y. Hence, letting % (1 TZ' 



we 



have: V5 G (0,1], VP over T-i, with probability at least 1 
Z ~ D^ 



VQ, kl(eQ(Z)||eQ) 



< 



m 



where eQ(Z) = Eh^QR{h, Z) = Ehr^g i^-- iLYijh{z,j)<o = '^h^Q 



KL(Q||P) + ln 



5 over the random draw of 



m + xl 



hiZij)<0 



, which is exactly 



equal to e^"°(S) (cf. (dH)); likewise, eg = Ez~Dy,„eQ(Z) 



ES-Dyg, 



Auc / 



G (0, 1], VP, with probability at least 1 — 6 over the random draw of S ~ D 



^(S) = e^"^. Hence, 



VQ, kl(e^^'^(S)||ea"^0 < 



Avc\ 



X^ 
m 



KL(QIIP) + ln 



m + xl 



(16) 
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Unconditioning on Y. As proposed by Agarwal et al. ( 20051 ). let us call $(P, S,(5) the 
event ([ISD; we just stated that Vy e {-1, +1}^ VP, V(5 G (0, 1], Ps~Dy (^'(P, S, S)) > I - S, 
or, equivalently 

Ps^D,(-^(P,S,5)|y = y) =Ps^D^(-^(P,S,5)) <5, 

i.e., the conditional (to Y = y) probability of the event ^^{P,S,S) is bounded by 6. This 
directly implies that the unconditional probability of ^^{P,S,5) is bounded by 6 as well: 

Fs^-D,hHP,S,6)) <Fs^D,hHP,S,6)\Y = y) <6. 

Hence, V(5 G (0, 1], VP, with probability at least 1 — 6 over the random draw of S ~ D^, 



VQ, kl(e^"-||e^-^)< 



ms 



KL(Q||P) + ln 



ms + X*s 



(17) 



where Xs fractional chromatic number of the graph r(Z), with Z defined from S as in 

the first part of the proof, where the observed (random) labels are now taken into account; 
here ms = where is the number of positive (negative) data in S. 

Computing the Fractional Chromatic Number. In order to finish the proof, it suf- 
fices to observe that, for Z = {Zij}ij, if ^max = max(£^,£~), then the fractional chromatic 
number of r(Z) is x* = ^max- 

Indeed, the clique number of r(Z) is £max as for all i = (j = 

{Zij : j = 1, . . . ,£~} {{Zij : z = 1, £"•"}) defines a clique of order i~ (i^) in r(Z). Thus, 
from Property d) x>X*> ^max- 

A proper exact cover C = {Cfc}^™^ of r(Z) can be constructed as follows. Suppose that 
Cax = then Ck = {Zia^^ii) : i = I, ■ ■ ■ A^}: with 

(Tfc(i) = (i + A; - 2 mod £+) + 1, 

is an independent set: no two variables Zij and Zpq in Ck are such that i = p or j = q. In 
addition, it is straightforward to check that C is indeed a cover of r(Z). This cover is of 
size = £max) which means that it achieves the minimal possible weight over proper exact 
(fractional) covers since x* ^ ^max- Hence, X* = X = ^max(= c(r)). Plugging in this value 
of X* ill (flTl) . and noting that ms = ^max^min with ^min = min(^+,£^), closes the proof. ■ 



We observe that in the theorem, the dependence o n the skew of the sam ple i s expressed in 
terms of 1/ mm{£^ whereas in the the works of lAgarwal et al.l (|2005l ) and lUsunier et al 
(|200Hh . ■;he bound depends on the larger l/£~^ + . 



The PAC-Bayes bound of Theorem [T3] can be specialized to the case where h{x, x') = 
f{x) — f{x') with f & {x w ■ X : w X}: / is therefore a linear scoring function and 
h{x, x') = w ■ [x — x'). The ranking rule h is thus a linear classifier acting on the difference 
of its argum ents (the next re sult we present therefore carries over to kernel classifiers). As 
proposed by LangfordI (j2005l ). we may assume an isotropic Gaussian prior P = A/'(0, /) and 
a family of posteriors Qw4i parameterized hy w ^ X and fi > such that Qw,fi is Af{fi, 1) 
in the direction w and M{0, 1) in all perpendicular directions, we arrive at the following 
theorem: 
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Theorem 15 (Auc Linear PAC-Bayes bound) y£,\/D over X x y, ^6 £ (0, 1], the fol- 
lowing holds with probability at least 1 — 6 over the draw o/ S ~ D^: 



Auc 



f ■ - 



Proof Straightforward from the bound of Langford ( 20051 ) and Theorem 1141 



Note that this specific parametrization of Q could have been done in Theorem [13] as well. 
We arbitrarily choose to provide it for this Auc based bound a, s learn i ng linear ranking rule 
by Auc minimization i s a co mmon approach (lAtaman et al.l . l2006l : iBrefeld and Scheffeii . 
20051 : iRakotomamonivl . booj ). and the presented result may be of practical interest (for 
model selection purpose, for instance) for a larger audience. 

The bounds given in Theorem [T^ and Theorem [15] are very similar to what we would get 
if applying IID PAC-Bayes bound to one (independent) element Cj of a minimal cover (i.e. 
its weight equals the fractional chromatic number) C = {Cj}"^^ such as the one we used 
in the proof of Theorem [14] This would imply the empirical error e^|'^'^ to be computed on 
only one specific Cj and not all the C^-'s simultaneously, as is the case for the new results. It 
turns out that, for proper exact fractional covers C = {{Cj ,uj)}^^^ with elements Cj having 
the same size, it is better, in terms of absolute moments of the empirical error, to assess 
it on the whole dataset, rather than on only one Cj. The following proposition formalizes 
this. 



Proposition 16 VD^, \/n, VC 
\Ci \ = . . . = \Cn\ then 



{{Cj,Uj}]^, G Pefc(D^), VQ, Vr e Ar,r > 1, if 



]Ez~D„|eQ(Z)-eQr <E 



U)\eQ 



(Z(^-))-eQr,ViG{l,...n}. 



Proof Using the convexity of 
for Z ~ D^: 



for r > 1 , the linearity of E and the notation of section [S] 
eQ(Z) - egr = \Y,^,{Ef,^QR{h, Z^^')) - R{hW 



< 



j 

^7r,|eQ(z(^))-eQr- 



Taking the expectation of both sides with respect to Z and noting that the random variables 
\eQ{Z^^^) — egl'', have the same distribution, gives the result. ■ 



This proposition upholds the idea of iPemmaraiu] (|200ll ) to base the decomposition of a 
dependency graph on equitable coloring. 



4.4 /3-mixing Processes 

Here, we provide a PAC-Bayes theorem for classifiers trained on dat a from a stationary 
/3-mixing process, of which we recal l some definitions, as formulated by Yu ( 19941 ) (see also, 
e.g., also Mohri and Rostamizadeh Jiooa)). 
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Definition 17 (Stationarity) A sequence of random variables Z = {Zt}f^^ is station- 
ary if, for any t and nonnegative integer m and k, the random subsequences {Zt, . . . , ^t+m) 
and {Zt+k, • • • 1 -^t+m+fc) 0,1"^ identically distributed. 



Definition 18 (/3-mixing process) Let Z = {Zt}f^^ be a stationary sequence of ran- 
dom variables. For any i,j G Z U {— oo, +oo}, let denote the a-algebra generated by the 
random variables Zj., i < k < j. Then, for any positive integer k, the /3-mixing coefficient 
(3{k) of the stochastic process Z is defined as 

/3(fe)=supEsup{|P(^|a5^)-P(A)| :AGc7+~}. (18) 

n>l 



Z is said to be ^-mixing if (3{k) — )• when k — )• oo. 



(No te t h ere is an equivalent definition of the /^-mixing coefficient based on finite partitions; 
see IYuI ( 1994 ) for details.) Stationary /3-mixing processes model a situation where the 
interdependence between the random variables at hand is temporal. When the process is 
mixing, it means that the strengh of dependence between variables weakens over times. 

Th e bound that we propose is in the same vein as the one proposed bv lMohri and Rostamizadeh 
(200^), with the difference that our bound is a PAC-Bayes bound and theirs a Rademacher- 
complexity-based bounds. In addition to being a new type of data-dependent bound for 
the case of stationary /3-mixing process, we may anticipate that, in practical situations, 
our bound inherits the tightness of the IID PAC-Bayes bound (whereas, to the best of 
our knowledge, there is no evidence of such practicality for Rademacher-complexity-based 
bounds). 

Let us state our generalization bound for classifiers trained on samples Z drawn from 
stationary /3-mixing distributions. 



Theorem 19 (/3-mixing process PAC-Bayes bound) Let m be a positive integer. Let 
D^ be a stationary (3-mixing distribution over Z and D^ be the distribution of m-samples 
according to D^. V?^ C TZ'^ , V/x, a € J\f such that 2fia = m,\/6 £ (2(/x — l)/3(a), 1], VP, with 
probability at least 1 — 6 over the random draw ofTi^ D^, the following holds 



where 



VQ, kl(e^(Z)||e^) 



< 



KL(Q|IP)+ln 



2(m+1) 



<5 - 2(/i - l)/3(a) 



(19) 



e^(Z) := Eh^QR{h, Z) = E^^q 



<o 



t=i 



"Q 



Z~D 



,e^(Z). 



Proof The proof makes use of the independent block decomposition proposed by Yu ( 19941 ). 
our chromatic PAC-Bayes bound of Theorem [8l and Corollarv 1241 (Appendix). 
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The chromatic bound for independent blocks. Let Z = {Zi, . . . , Zm} be the random 
variables we have to deal with. If fi and a are two integers such that 2fia = m (we assume 
that m is even, if it is odd one may drop the last variable Z^ and work on a sample of size 
m — 1). Then Z can be decomposed into two subsequences Zq and Zi as follows: 

Zo := {Zq := {Za(2s~2)+1, ■ ■ ■ , ^a{2s-2)+a) '■ S G [/i]}, 
Zl := {Zi := (^a(2s-l)+l) • • • ; ^a{2s-l)+a) ^ ^ € [^j. 

Both Zo and Zi are made of /i blocks of a consecutive random variables. The blocks are 
interdependent as well as the variables within each block. Dq will denote the distribution 
of Zo. 

We now define a sequence Z of independent blocks as: 

Z:={r := (Zf,...,Z^):.eM}, 

such that the blocks are mutually independent and such that each block has the same 
distribution as Zq, that is, from the stationarity assumption, the distribution of Zj (the 
blocks are IID). 

The dependency graph F of Z is such that all the variables in a block are all connected 
and such that there are no connections between blocks. Theorem [8] can readily be applied 
to the random sample Z, whose distribution we denote D: for all P and 6 G (0, 1], 

Pz~d(^'(P,Z,5)) <5, (20) 

with eq := Ez~DeQ(Z) and ^{P, Z, J) is the event defined as: 

fi + l ' 

5 

To see why and how Theorem [8] can be used to get ([20]) . observe that: 
• the number of variables in Z is /ia; 



HP,Z,5) := {bQ, kl(eQ(Z)||eQ) > - 



KL(Q|[P) +ln. 



• by stationarity, all variables Z^, for a G [a] and s G [fi] share the same distribution: 
we therefore do actually work with dependent but identically distributed variables; 

• the (fractional) chromatic number x* of T is a, since 

1. the clique number is a (i.e. the number of variables in each block), 

2. the cover C of F with 

C:={(C«:={Zi,...,Zai)}i<„<„ 
is a proper exact cover of size a. 
Noting that, consequently 

1^ - A - i and l^"- + K* _ fJ-a + a _ + 1 
/xa fj,a fi 5x* 5a 6 

gives the expression of <I>(P, Z, 5) and (pO]) . 



Th e last two steps of the proof are similar to those used by iMohri and Rostamizadeh 
(|2009l l l;o establish their bound. 
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A bound for Zq. To establish the bound for Zq, it suffices to use Coronarvl24l (Appendix) 
with c(z) being defined as: 

c(z) := I[$(p,z,5), 

which is a bounded measurable function on the blocks Zq (and thus on the blocks Z^). We 
have: 

|lEzo~Doc(Zo) - Ez^Dc(Z)| < (/i - l)/3(a), 
and therefore, since Pzo~Do(^(^', Zq, 5)) = Ezo~Doc(Zo) and Pz^d ($(i^, Z, 5)) = Ez^dc(Z): 



^Zo~Do(^(i',Zo,5)) < Pz~D ($(P,Z,(5)) + {^i-l)P{a) 
< J + (M-l)/3(a). 



(21) 
(cf. ^) 



Establishing the bound. Finally, observe that: 

<J.(P,Z,5) ^ 3Q : ikl(eQ(Zo)||eQ) + i kl(eQ(Zi)| jeg) > ^ 



KL(Q||P) +ln 



/U+ 1 



3Q: V |kl(eQ(Z,)||eQ)) > 



ie{o,i} 

^ V |3Q:kl(eQ(Z,)||eQ))>^ 
^«>(P,Zo,5) V$(P,Zi,(5), 



KL(Q||P) +ln 
KL(Q||P) + ln 



/i + 1 



(5 

^ + 1 



where we used eQ(Z) = eQ(Zo)/2 + eQ(Zi)/2 and the convexity of kl in the first line. 
This leads to: 

P^^j,, ($(P, Z, 5)) < F^^^,J^P, Zo, 5) V «>(P, Zi, 5)) 

< P„^„^ ($(P, Zo, 6)) + P„^„^ ($(P, Zi, 5)) (union bound) 
= 2F^^^pJ<^{P, Zo,6)) (stationarity) 

= 2Pzq~Do(^(-P5 Zq, (5)) (marginalization wrt Zq) 

< 2(5 + 2(/u- l)/3(a). (cf. ^) 



Adjusting 5 to 5/2 — {fi — l)/3(a) ends the proof. 



5. Conclusion 

In this work, we propose the first PAC-Bayes bounds applying for classifiers trained on 
non-IID data. The derivation of these results rely on the use of fractional covers of graphs, 
convexity and standard tools from probability theory. The results that we provide are very 
general and can easily be instantiated for specific learning settings such as ranking and 
learning from from mixing distributions: amazingly, we obtain at a very low cost original 
PAC-Bayes bounds for these settings. Using a generalized PAC-Bayes bound, we provide 
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in the appendix a chromatic PAC-Bayes bound that holds for non-independently and non- 
identically distributed data: it allows us to derive a PAC-Bayes bound for classifiers trained 
on data from a stationary (/9-mixing distribution. 

This work gives rise to many interesting questions. First, it seems that using a fractional 
cover to decompose the non-IID training data into sets of IID data and then tightening 
the bound through the use of the chromatic number is some form of variational relaxation 
as often encountered in the context of inference in graphical models, the graphical model 
under consideration in this work being one that encodes the dependencies in D^- It might 
be interesting to make this connection clearer to see if, for instance, tighter and still general 
bounds can be obtained with more appropriate variational relaxations than the one incurred 
by the use of fractional covers. 

Besides, Theorem [5] advocates for the learning algorithm described in Remark [71 We 
would like to see how such a learning algorithm based on possibly multiple priors/multiple 
posteriors could perform empirically and how tight the proposed bound could be. 

On another empirical side, it might be interesting to run simulations on bipartite rank- 
ing problems to see how accurate the bound of Theorem [15] can be: we expect the results 
to be of good quality, because of the resemblance of the bound of the theorem wit h the IIP 
PAC-B ayes theorem for margin classifiers , which has proven to be rather accurate iLangford 
(|2005l ). The work of I Germain et al. I (|2009l ) is also another contribution that tends to support 
that a practical use of our bounds should provide competitive r esults (note that Theor em [251 
gives a sufficient condition for the general PAC-Bayes bound of Germain et al. ( 20091 ) to be 
non degenerate). Likewise, it would be interesting to see h ow the possibly more accu- 
rate P AC-Bayes bound for large margin classifiers proposed by ILangford and Shawe-tavlor 
(|2002l V which should translate to the case of bipartite ranking as well, performs empirically. 
The question also remains as to what kind of strategies to learn the prior(s) could be used 
to render the bound of Theorem [5] the tightest possible. This is one of the most stimulat- 
ing question as perfo rming such prior learnin g makes it possible to obtain very accurate 
generalization bound I Ambroladze et al.l (|2007l ). 

The connection between our ranking bounds and the theory of U-statistics makes it 
possible to envision the use of higher order moments in establishing PAC-Bayes bounds, 
thanks to Hoeffding's decomposition. We plan to investigate further in this direction, for 
both the ranking measures we have studied (noting that the Auc is a two-sample U-statistics 
(|Hoeffding] . [l963i lV 

Finally, we have been working on a more general way to establish chromatic bounds 
from IID bounds (covering VC, Rademacher, PAC-Bayes and - possibly - binomial tail 
bounds), without the need to perform 'low- level' calculations such as the ones proposed 
in section 13.41 The meta-bound tha,t we have been developping is in the spirit of that 
proposed by iBlanchard and Fleuret (l2007l l. except that the randomization we propose is 
on the subsets constituting the fractional cover (and not the hypothesis set). In other 
terms, given a cover C = {{Cj,ujj)}j, the fact that an IID bound holds on one subset 
Cj of a cover is considered as a random event, the probability of a subset to be chosen 
being iJj/u}{C). A simple union bound gives our generic result, which translates into cover- 
independent (but fractional-chromatic-number-dependent) chromatic bounds such as ([6]) 
(Theorem [8]) under very mild conditions on the shape of the base IID bound. Along with 
that work, we try to answer the question of establishing a principled way to handle situations 
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where random variables show weak dependencies (as is the case for /3-mixing processes), 
as for now, the framework described here apphes when variables are either dependent or 
independent, disregarding the magnitude of the dependencies - our PAC-Bayes bound for 
^-mixing processes would then be a specific case of such general result. 
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6. Appendix 

6.1 Technical Lemmas 

Lemma 20 Let D he a distribution over Z. 



Proof Let h ^H. For z G , we let g(z) = ^(/i,z); wc also let v = ^{h)- Note that 
since Z is i.i.d, m(/(Z) is binomial with parameters m and p (recall that r{h,Z) takes the 
values and 1 upon correct and erroneous classification of Z by h, respectively). 



,mkl{R{h,Z)\\R{h)) <^+l. 




kl(^lb)p, 



'z~I?'"("7,g(Z) 



k) 



0<k<m 






However, it is obvious that, from the definition of the binomial distribution. 




m—k 



< 1. 



This is obviously the case for i = ^, which gives 
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Theorem 21 (Jensen's inequality) Let f S TZ'^ be a convex function. For all probability 
distribution P on X : 

f{Ex^pX)<Kx^pf{X). 

Theorem 22 (Markov's Inequahty) Let X be a positive random variable on TZ, such 
that EX < oo. 

r EXl 1 

v^G7^,Px|x> — I < -• 

Consequently: VM > EX,Vt G7^,Px{^>f■}^i• 

Lemma 23 (Convexity of kl) \fp,q,r,s G [0,1], Va G [0,1], 

kl{ap + (1 - a)g||ar + (1 - a)s) < akl(p||r) + (1 - a)kl(g||s). 

Proof It suffices to see that / G TZ^^'^^^ , f {v = \p q]) = kl(g[|p) is convex over [0,1]^: the 
Hessian H of f is 



H 



and, for g g [0, 1], ^ + > and det H 



q I i-g _i L 

{i—p)'^ p i—p 



9 ^ 1-9 



(p-g) 



^_ -.a > 0: H y and / is indeed 



convex. 



Finally, we have the following version by Mohri and Rostamizadeh ( 20091 ) of Corol- 
lary 2.7 in (jYu|, 19941 ). which is based on the definition of the blocks Z|: 



Corollary 24 Let c be a measurable function defined with respect to the blocks Zq. If c has 
absolute value bounded by M, then 

|IEzo~Doc(Z) - Ez^Dc(Z)| < - l)M/3(a). 



6.2 Applications of a Generic PAC-Bayes Theorem 

Let us first recall the following gene ric P AC-Bayes re s ult, w hich is a corollary/compound 
of results proposed bv lSeegei] f|2002bl ^ and iMcAllesteil ^200i ). In particular, the 7 function 
need not be differentiable with respect to its second argument and it applies to any 'risk' 
functional ip for which a concentration inequality exists. 

Corollary 25 (Generic PAC-Bayes Theorem) Let ^ C 7^'^ and ip : Ti x [j'^^i 

TZ. If there exist a > 1, /3 > 1 and a nonnegative convex function A : 7^ x 7^ ^ TZ^ that is 

strictly increasing with respect to its second argument such that 

yh €n,ye> O, Pz~D™ [E^(/i) - V'(/i,Z) > e] < aexp(-/3A(E'(/'(/i),e)) , (22) 

where E^(/i) stands for Ez^d^V'C^i Z), then, VP, with probability at least 1 — 5 over the 
draw 0/ Z ~ • 



VQ, A(eg,eg-4(Z))< 



1 



P-1 



KL(Q||P)+ln^ 



(23) 
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where 



eg(Z) :=EH^Qi^{h,Z) 

el :=Ez4(Z) =IEh~Q]EzV'(/^,Z) 
Proof Along lines from (jSeegei] . liooibl ') and (jMcAllesterl . l2003l l. 

1. Observe that, thanks to Lemma [26] (below) with (5(e) := A(E'(/'(/i), e), 



Applying Markov's inequality then gives: 



< <5 



2. Using the entropy extremal inequality lnEx~p/(X) > - KL(Qj|P)) +Ex~q ln/(X), 
yP,Q,X (see the proof of Lemma [TTj). and the fact that x i— Inx is nondecreasing, 
the previous step leads to 



aB' 

3g: -KL(Q||P) + (/3-l)Eft^QA(EV^(/i),EV(/i) -V(/i,Z)) >ln^ 



3. Since A is convex, Jensen's inequality can be used to give (here, h ^ Q) 



3Q : -KL(Q||P) + (/3 - l)A(E^,zV'(/i, Z), E;,,zV'(/i, Z) - Eft,^(/i, Z)) > In 



< 5. 



af3 



< 5. 



Lemma 26 (jMcAllesterl (120031 )) Let X be a real-valued random variable on X and a > 
> 1. Let 5 :TZ^TZ be a nonnegative and strictly increasing function. We have: 



Vx € n, F[X >x]< ae-^^(^') ^ E 



< al3. 



Proof See the proof of iMcAllesteil (|2003l l. Here, we take a into account. As / is strictly 



mcreasmg: 



[X>x]=¥ [6{X) > 6ix)] 



,{l3-mx) > g{/3-l)5{x) 



Hence: P [e(/5-i)5(^) > e(/3-i)5(^')] < ae'l^^^'-'l Setting u = e(/^-i)'^(^) , we get: 



< min(l,azy-'^/(''-i))). 



Thus, as for a nonnegative random variable W, K[W] = f^F[W > u\dv: 



E 



< 1 + a y z.-'3/{/3-i) = 1 + _ 1). 



Since a > 1, 1 + q(/3 — 1) < a/3, which ends the proof. 



We observe that: 
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i=l '^Y,h{Xi)<0 



then, by the one-sided Chernoff bound, a = 1, P 



m 



and A{p,e) = kl(p — £\\p) make equation (j22p hold. The PAC-Bayes bound provided 
by Corohary [25] is that of Theorem [1] where m is replaced by m — 1 ; 



if 



Vi G [m] 



sup I 



5 



1 Zi—l, Z^, Zi^i, 



< Ci 



then, thanks to McDiarmid inequality (McDiarmid, 19891 ). a = 1, fi = 2/^^c? and 
A(p, e) = , make equation (j22p hold and a PAC-Bayes bound can be derived (we let 
the reader write the corresponding PAC-Bayes bound); 

• it suffices to have an appropriate concentration inequality for the problem at hand to 
have an effective PAC-Bayes bound. 

6.2.1 Generalized Chromatic Pac-Bayes Bound 

To get a chromatic PAC-Bayes theorem for non-identically non-inde pendently dist ributed 
data, we simply make use of the following concentration inequality of iJanson (I2OO4I I. 

Theorem 27 ( Jansotj (j2004 ^) Suppose that Z = {Zi}^^ is an m-sample of real-valued 
random variables distributed according to some distribution Dm- Suppose that each Zi has 
range [ai,bi]. If Sz = YlT=i^i' ^^en, 

ye > 0, Fs^ [E5z - 5z > e] < exp 

L X*(Dm)Li=i(Oi-ai) 

where x*(Dm) is the fractional chromatic number of the dependency graph o/Dm- 

Note that no assumption is made on the Zj's being identically distributed. 

This concentration inequality gives rise to the following generalized chromatic PAC- 
Bayes bound that applies to non indepently, possibly non identically distributed data and 
allows us to use any bounded loss functions r. 

Theorem 28 (Generalized Chromatic PAC-Bayes Bound) VD™, VTi, V5 G (0, 1], VP, 

with probability at least 1 — 6 over the random draw of 7j ^ D^, the following holds 



VQ, |eQ(Z)-eQp< 



KL(Q||P) + ln 



2m - x*M2 

where x* stands for x*(Dm); r is a bounded function with range M and 

eg := Eh^qeQiZ) = Ej,^QEz^-D^R{h,Z), 
with R(h,Z) ■.= ^-r[h,Zi)/m. 

Proof It suffices to apply Corollary [25] with Theorem 1271 a = 1, A(p, e 
2m/x*M (since, as r has range M, R has range M/m). 

We notice the following. 



(24) 



and /3 
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• Here, as no assumption is done regarding the identical distribution of the Zj's, the 
expected risk R{h) = E,zR{h, Z) does not unfold as in ([3|). 

• In the case of using identically distributed random variables and the 0-1 loss, there 
is no concentration inequality that allows us to retrieve the tighter PAC-Bayes bound 
given in Theorem [8l 

• From a more general point of view, it is enticing to try to establish even more generic 
results resting on the principle of graph coloring with the aim of decoupling this 
approach to the PAC-Bayesian framework. This is the subject of ongoing work. 

6.2.2 (/3-MixiNG Pac-Bayes Bound 

The definition of a (/3-mixing process follows. 

Definition 29 (99-mixing process) Let Z = {Zt}f^^ be a stationary sequence of ran- 
dom, variables. For any i,j £ZU {— oo,-|-oo}, let aj denote the a-algebra generated by the 
random variables Zk, i <k < j. Then, for any positive integer k, the ip-mixing coefficient 
ip{k) of the stochastic process Z is defined as 



sup |P[^|S] -P[A]| . 



(25) 



Z is said to be ip-mixing if ip{k) — > as — > 0. 

In order to establish our new PAC-Bayes bounds for stationa ry mixing distributions, it 
suffice s to make use of the following concentration inequality by iKontorovich and Ramanan 

Theorem 30 ( Kontorovich and Ramanan ( 20081 )) Let ip : U"^ —^TZbea function 
defined over a countable space U. If tp is l-Lipschitz with respect to the Hamming metric 
for some I > 0, then the following holds for all t > 0: 



Ezl^/'IZ)]! >t]< 2exp 



2mP\\A l|2 



where ||Am||oo < 1 + '^Y1T=1 v{k)- 

Suppose that the loss function r is again such that it takes values in [0,M]. Then, for 
any h ^ Ti, the function = ^Ym=i'''{h, Zi) = R{h,Z) is obviously M/m-Lipschitz. 

Therefore, for a sample Z drawn according to a (/9-mixing process, we have the following 
concentration inequality on R(h, Z) that holds for any h (z H: 



Z~D„ 



R{h,Z)-R{h) 



> t 



< 2exp 



mt 



2M2||A^||^ 



(26) 



We directly get the following PAC-Bayes bound for (/j-mixing processes. 
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Theorem 31 (PAC-Bayes bound for stationary 99-mixing processes) Let D*^ he a sta- 
tionary if-mixing distribution over Z and be the distribution of m-samples according 
to D"^. V"H C TZ"^ J \/5 € (0, 1], VP, with probability at least 1 — 5 over the random draw of 
Z ~ D^, the following holds 



2M2||A^||2^ 



m 



KL(Q||P) + ln 



m 



ivi iiiv^ijo^ 



where IIA^ 



<l + 2ELi'^(^). r{h,Z)=l 



Yh{X)<Q 



and 



In- 



e^(Z) ■.= W.h^QR{h,Z)=W.h^QY,lYMX,)<o 

t=i 

el :=Ez^:,.e^(Z). 



Proof Equation and Corollary [25] with a = 2, /? = m/(2M2 ||A||^), A(p,e) = e^. 
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